Background

Loan default prediction is important because it helps lenders and financial institutions to assess the risk of granting loans to borrowers. By predicting the likelihood of a borrower defaulting on a loan, lenders can make informed decisions on whether to approve the loan, how much to lend, and at what interest rate. This helps to reduce the risk of financial losses due to default and can improve the overall stability of the financial system.

Hide Data

Show Data

This data set was collected from Github repository. In the case of this data the default column: 1 means they paid off their loan and 0 is the opposite. There are about 39,685 data points within this table.

Select Columns

selected <- default2 %>%
  select(default, loan_amnt, annual_inc, income_loan_ratio)
datatable(selected, options=list(lengthMenu = c(3,10,30),scrollY=300,scroller=TRUE,scrollX=TRUE), 
            extensions="Scroller")
## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html

Full Data

datatable(default, options=list(lengthMenu = c(3,10,30),scrollY=300,scroller=TRUE,scrollX=TRUE), 
            extensions="Scroller")
## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html

Logistic Model

The probability of the loaning being paid back is based on the ratio of money loaned out to the debtors annual income is given by the following logistic regression model.

\[ P(Y_i = 1|x_i) = \frac{e^{\beta_0+\beta_1 x_i}}{1+e^{\beta_0 + \beta_1 x_i}} = \pi_i \]

In this model, for each previous shuttle launch \(i\):

If \(\beta_1\) is zero in the above model, then \(x_i\) (ratio of money loaned) provides no insight about the probability of repayment. If one however, then the ratio plays an important role in the probability of repayment. Using a significance level of \(\alpha = 0.05\) we will test the below hypotheses about \(\beta_1\).

\[ H_0: \beta_1 = 0 \\ H_a: \beta_1 \neq 0 \]

Fitting the Model

The estimates of the coefficients \(\beta_0\) and \(\beta_1\) for the above logistic regression model and data are shown below.

default.log <- glm(default ~ income_loan_ratio, data= default2, family=binomial)
summary(default.log) %>% pander()
  Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.284 0.0283 80.69 0
income_loan_ratio -2.441 0.1147 -21.28 1.641e-100

(Dispersion parameter for binomial family taken to be 1 )

Null deviance: 32489 on 39684 degrees of freedom
Residual deviance: 32053 on 39683 degrees of freedom


This gives the estimated model for \(\pi_i\) as \[ P(Y_i = 1|x_i) \approx \frac{e^{2.284-2.441x_i}}{1+e^{2.284 - 2.441x_i}} = \hat{\pi}_i \] where \(b_0 = 2.284\) is the value of the (Intercept) which estimates \(\beta_0\) and \(b_1 = -2.441x_i\) is the value of income_loan_ratio which estimates \(\beta_1\).

Importantly, the \(p\)-value for the test of income_loan_ratio shows a significant result \((p = 1.641e-100)\) giving sufficient evidence to conclude that \(\beta_1 \neq 0\). The loan income ratio effects the probability of the loan being repaid.

Visualizing the Model

palette(c("purple3", "grey22"))
plot(default ~ income_loan_ratio, data=default2, xlab="Loan Percent of Income ", ylab="Repayment Probability", main="Probability of Repayment", col= as.factor(default), pch=18)
b <- coef(default.log)
curve(exp(b[1]+b[2]*x)/(1+exp(b[1]+b[2]*x)), add = TRUE)
legend("topright", col=palette(), pch=18, legend=c("Default", "Repaid"), bty="n", text.col = palette())

Looking at the plot above you can see that the chance of the loan being paid back decreases as the percent of income increases.

Diagnosing the Model

To demonstrate that the logistic regression is a good fit to these data we apply the Hosmer-Lemeshow goodness of fit test (since there are only a couple repeated \(x\)-values) from the library(ResourceSelection).

Test statistic df P value
15.36 8 0.05254

Since the null hypothesis is that the logistic regression is a good fit for the data, we claim that the logistic regression is appropriate (p-value = 0.05254).

Prediction

For a hypothetical situation imagine my income is $10k as a college student and I want to buy a used car for about $5K. That means that my Loan Percent of Income would be about 0.50 the prediction for such would come out to:

predicted <- predict(default.log, newdata= data.frame(income_loan_ratio = 0.5), type="response")
predicted %>% pander()
1
0.7433

The probability of me paying back the loan is about \(74.33%\).